XPath

XPath is short for XML Path Language which is a query language for selecting nodes in an XML document. This is very useful in webscraping because all HTML documents are a form of XML documents.


In [1]:
import requests
from lxml import html

In [2]:
%%HTML
<html>
  <body>
    <h1>Favorite Python Librarires</h1>
    <ul>
      <li>Numpy</li>
      <li>Pandas</li>
      <li>requests</li>
    </ul>
  </body>
</html>


Favorite Python Librarires

  • Numpy
  • Pandas
  • requests

Load HTML Code

Now I'll read the code from cell number 2 and store it in html_code. Finally we will parse that into a lxml node object.


In [3]:
html_code = In[2]
html_code = html_code[42:-2].replace("\\n","\n")
print(html_code)

doc = html.fromstring(html_code)


<html>
  <body>
    <h1>Favorite Python Librarires</h1>
    <ul>
      <li>Numpy</li>
      <li>Pandas</li>
      <li>requests</li>
    </ul>
</html>

Using xpath to find nodes in a document

There many methods for fidning a node that you are interested in from a XML or HTML document. The first way is to write the whole path separated by forward slashes /

Reading <h1> tag


In [4]:
title = doc.xpath("/html/body/h1")[0]
title


Out[4]:
<Element h1 at 0x7f447cafa458>

To read the text inside that tag you can use the text variable.


In [5]:
title.text


Out[5]:
'Favorite Python Librarires'

Another way is read the text is to use the text() function in xpath.


In [6]:
title = doc.xpath("/html/body/h1/text()")[0]
title


Out[6]:
'Favorite Python Librarires'

Working with multiple items

xpath always returns a list. If there are no matches, it will return an empty list. If there is one match it will return a list with one item.


In [7]:
item_list = doc.xpath("/html/body/ul/li")
item_list


Out[7]:
[<Element li at 0x7f447cafa9a8>,
 <Element li at 0x7f447cafaae8>,
 <Element li at 0x7f447cafab38>]

We can use text() function with multiple items.


In [8]:
doc = html.fromstring(html_code)
item_list = doc.xpath("/html/body/ul/li/text()")
item_list


Out[8]:
['Numpy', 'Pandas', 'requests']

Tag selector without full path

you can select any node in your document that matches a node selector without using the full path with a double forward slash //


In [9]:
doc = html.fromstring(html_code)
item_list = doc.xpath("//li/text()")
item_list


Out[9]:
['Numpy', 'Pandas', 'requests']

Selecting one result

You can select one result from a list using [index] after your tag selector. Make sure you use it on the tag selector and not a function selector.

Notice: This is index starts from 1.


In [10]:
doc = html.fromstring(html_code)
item_list = doc.xpath("/html/body/ul/li[1]/text()")
item_list


Out[10]:
['Numpy']

In [11]:
%%HTML
<html>
  <body>
    <h1 class="text-muted">Favorite Python Librarires</h1>
    <ul class="nav nav-pills nav-stacked">
      <li role="presentation"><a href="http://www.numpy.org/">Numpy</a></li>
      <li role="presentation"><a href="http://pandas.pydata.org/">Pandas</a></li>
      <li role="presentation"><a href="http://python-requests.org/">requests</a></li>
    </ul>
    <h1 class="text-success">Favorite JS Librarires</h1>
    <ul class="nav nav-tabs">
      <li role="presentation"><a href="http://getbootstrap.com/">Bootstrap</a></li>
      <li role="presentation"><a href="https://jquery.com/">jQuery</a></li>
      <li role="presentation"><a href="http://d3js.org/">d3.js</a></li>
    </ul>
</html>


Favorite Python Librarires

Favorite JS Librarires


In [12]:
html_code = In[11]
html_code = html_code[42:-2].replace("\\n","\n")
print(html_code)

doc = html.fromstring(html_code)


<html>
  <body>
    <h1 class="text-muted">Favorite Python Librarires</h1>
    <ul class="nav nav-pills nav-stacked">
      <li role="presentation"><a href="http://www.numpy.org/">Numpy</a></li>
      <li role="presentation"><a href="http://pandas.pydata.org/">Pandas</a></li>
      <li role="presentation"><a href="http://python-requests.org/">requests</a></li>
    </ul>
    <h1 class="text-success">Favorite JS Librarires</h1>
    <ul class="nav nav-tabs">
      <li role="presentation"><a href="http://getbootstrap.com/">Bootstrap</a></li>
      <li role="presentation"><a href="https://jquery.com/">jQuery</a></li>
      <li role="presentation"><a href="http://d3js.org/">d3.js</a></li>
    </ul>
</html>

Attributes selector

In this example we have two <h1> tags with different css classes. We can select tags based on css classes as follows:


In [13]:
title = doc.xpath("/html/body/h1[@class='text-muted']/text()")[0]
title


Out[13]:
'Favorite Python Librarires'

contains() function

I want to select all items in the first list. I could use the full class for selection or I could just use one of the classed only used in the first list with the contains() function.


In [14]:
item_list = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/text()")
item_list


Out[14]:
['Numpy', 'Pandas', 'requests']

Returning attributes

What if we want to read the href attribute of the <a> tag to get the link. This is how you do that:


In [15]:
item_list = doc.xpath("/html/body/ul[contains(@class,'nav-stacked')]/li/a/@href")
item_list


Out[15]:
['http://www.numpy.org/',
 'http://pandas.pydata.org/',
 'http://python-requests.org/']

Real world example

Read the list of languages with 1M+ articles on http://www.wikipedia.org/


In [16]:
response = requests.get("http://www.wikipedia.org")
doc = html.fromstring(response.content, parser=html.HTMLParser(encoding="utf-8"))

In [17]:
lang_list = doc.xpath("//div[@class='langlist langlist-large hlist'][1]/ul/li/a/text()")
lang_list


Out[17]:
['Deutsch',
 'English',
 'Español',
 'Français',
 'Italiano',
 'Nederlands',
 'Polski',
 'Русский',
 'Sinugboanong Binisaya',
 'Svenska',
 'Tiếng Việt',
 'Winaray']

In [ ]: